Batch and Online Spam Filter Comparison
نویسندگان
چکیده
In the TREC 2005 Spam Evaluation Track, a number of popular spam filters – all owing their heritage to Graham’s A Plan for Spam – did quite well. Machine learning techniques reported elsewhere to perform well were hardly represented in the participating filters, and not represented at all in the better results. A non-traditional technique Prediction by Partial Matching (PPM) – performed exceptionally well, at or near the top of every test. Are the TREC results an anomaly? Is PPM really the best method for spam filtering? How are these results to be reconciled with others showing that methods like Support Vector Machines (SVM) are superior? We address these issues by testing implementations of five different classification methods on the TREC public corpus using the online evaluation methodology introduced in TREC. These results are complemented with cross validation experiments, which facilitate a comparison of the methods considered in the study under different evaluation schemes, and also give insight into the nature and utility of the evaluation regimens themselves. For comparison with previously published results, we also conducted cross validation experiments on the Ling-Spam and PU1 datasets. These tests reveal substantial differences attributable to different test assumptions, in particular batch vs. on-line training and testing, the order of classification, and the method of tokenization. Notwithstanding these differences, the methods that perform well at TREC also perform well using established test methods and corpora. Two previously untested methods – one based on Dynamic Markov Compression and one using logistic regression – compare favorably with competing approaches.
منابع مشابه
Learning and Detecting Concept Drift
The volume of data that humans create has increased explosively as information science and technology have evolved. Therefore, the demand for learning machines that can extract input-output mappings and knowledge rules from massive data sets has become more urgent, and machine learning is now a core technology in the advanced information society. It has been applied to fields such as pattern re...
متن کاملFeature-based Malicious URL and Attack Type Detection Using Multi-class Classification
Nowadays, malicious URLs are the common threat to the businesses, social networks, net-banking etc. Existing approaches have focused on binary detection i.e. either the URL is malicious or benign. Very few literature is found which focused on the detection of malicious URLs and their attack types. Hence, it becomes necessary to know the attack type and adopt an effective countermeasure. This pa...
متن کاملNot So Naive Online Bayesian Spam Filter
Spam filtering, as a key problem in electronic communication, has drawn significant attention due to increasingly huge amounts of junk email on the Internet. Content-based filtering is one reliable method in combating with spammers changing tactics. Naı̈ve Bayes (NB) is one of the earliest content-based machine learning methods both in theory and practice in combating with spammers, which is eas...
متن کاملSpamCooling: A Parallel Heterogeneous Ensemble Spam Filtering System Based on Active Learning Techniques
Anti-spam technology is developing rapidly in recent years. With the emerging applications of machine learning in diverse fields, researchers as well as manufacturers around the world have attempted a large number of related algorithms to prevent spam. In this paper, we designed an effective anti-spam protection system, SpamCooling, based on the mechanism of active learning and parallel heterog...
متن کاملNot So Naı̈ve Online Bayesian Spam Filter
Spam filtering, as a key problem in electronic communication, has drawn significant attention due to increasingly huge amounts of junk email on the Internet. Content-based filtering is one reliable method in combating with spammers’ changing tactics. Naı̈ve Bayes (NB) is one of the earliest content-based machine learning methods both in theory and practice in combating with spammers, which is ea...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006